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Abstract 

We provide a general theoretical analysis of expected out-of-sample utility, also referred to as decision- 
theoretic classification, for non-decomposable binary classification metrics such as F-measure and Jac- 
card coefficient. Our key result is that the expected out-of-sample utility for many performance metrics is 
provably optimized by a classifier which is equivalent to a signed thresholding of the conditional proba¬ 
bility of the positive class. Our analysis bridges a gap in the literature on binary classification, revealed in 
light of recent results for non-decomposable metrics in population utility maximization style classifica¬ 
tion. Our results identify checkable properties of a performance metric which are sufficient to guarantee 
a probability ranking principle. We propose consistent estimators for optimal expected out-of-sample 
classification. As a consequence of the probability ranking principle, computational requirements can be 
reduced from exponential to cubic complexity in the general case, and further reduced to quadratic com¬ 
plexity in special cases. We provide empirical results on simulated and benchmark datasets evaluating 
the performance of the proposed algorithms for decision-theoretic classification and comparing them to 
baseline and state-of-the-art methods in population utility maximization for non-decomposable metrics. 


1 Introduction 


Many binary classification metrics in popular use, such as and Jaccard, are non-decomposable, which in¬ 
dicates that the utility of a classifier evaluated on a set of examples cannot be decomposed into the sum of the 
utilities of the classifier applied to each example. In contrast, decomposable metrics such as accuracy eval¬ 
uated on set of examples can be decomposed into a sum of per-example accuracies. Non-decomposability 
of a performance metric is often desirable as it enables a non-linear tradeoff between the overall confu¬ 
sion matrix entries: true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). 
As a result, non-decomposable performance metrics remain popular for imbalanc ed and rare event clas¬ 
sification in medical diagno s is, fraud detect i on, information retrie val applications BLewis and Galel 11994 . 
Drummond and Holte , 2005 . Gu et al. . 2009i He and Garcia . 2009 1. and in other problems where the practi¬ 
tioner is interested in measuring tradeoffs beyond standard classification accuracy. 

A recent flurry of theoretical results and practi cal algorithms highl i ghts a growing interest i n understand 


ing and optimizing non - decom posable metrics iDembczvnski et al.L |201 ll. lYe et al.L I2012L iKoveio et al 


20141 iNarasimhan et al.L 1201411 . Existing theoretical analysis has focused on two disti nct approaches for 
characterizing the population version of the non-decomposable metrics: identified by lYe ef al.l 1201211 as 
decision theoretic analysis (DTA) and empirical utility maximization (EUM). DTA population utilities mea¬ 
sure the expected gain of a classifier on a fixed-size test set, while EUM population utilities are a function 
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of the population confusion matrix. In other words, DTA population utilities measure the the average utility 
over an infinite set of test sets, each of a fixed size, while EUM populafion ufilifies evaluate fhe performance 
of a classifier over a single infinifely large fesf sef. 

If has recenfly been shown fhaf for EUM based population utilities, fhe optimal classifier for large classes 
of non-decomposable binary classification mefrics is jusf fhe sign of fhe fhresholded conditional probab ilify 


of fhe positive class wifh a mefric-dependenf fhreshold IlKoyejo ef ahl 120141. INarasimhan et al.L |2014|] . In 


addition, practical algorifhms have been proposed for such EUM consisfenf classification based on direcf 
opfimizafion for fhe fhreshold on a held-ouf validafion sef. In sfark confrasf fo fhis burgeoning undersfanding 
of EUM optimal classification, we are aware of only fwo mefrics for whi ch DTA consisfenf classifiers have 


been derived and sho wn fo exhibif a simple form; namely, fhe Fp mef ric llEewisLll995LlDembczynski ef al 
2011 . Ye ef al. . 2012 ] and squared error in counfing (SEC) sfudied by EewisI 1 1995 1. 


In fhis paper, we seek fo bridge fhis gap in fhe binary classificafion liferafure, and provide a general 
fheorelical analysis of DTA populafion ufilifies for non-decomposable binary classificafion mefrics. Infer- 
esfingly, we show fhaf for many mefrics fhe DTA opfimal classifier again comprises signed fhresholding of 
fhe condifional probabilify of fhe posifive class. As we show, for a mefric fo have such an opf imal ctessifier 
if musf obey fhe so-called probability ranking principle (PRP), which was firsf formalized by lEewisI 1119951] 
in fhe informafion refrieval confexf. We idenlify a sufficiency condifion (a cerfain monofonicify properly) 
for a mefric fo obey PRP. We show fhaf Ihese condi fions are safisfied by large families of binary performance 
mefrics including fhe monof onic family sfudied bv INarasimhan et al.l OOldll . and a large subset of the linear 
fractional family studied bv lKoveio et al.l ]|20141] . We also recover known results for the special cases of Fp 
and SEC. 

While the optimal classifiers of both EUM and DTA population utilities associated with the performance 
metrics we study comprise signed thresholding of the conditional probability of the positive class, the eval¬ 
uation and optimization for EUM and DTA utilities require quite different techniques. Given a classifier and 
a distribution, evaluating a population DTA utility can involve exponential-time computation, even leaving 
aside maximizing the utility on a fixed test set. As we show, in light of the probability ranking principle, 
and with careful implementation, this can actually be reduced to c ubic complexity . These computations can 
be further reduced to quadratic complexity in a few special cases I Ye et al.1 2012 1. To this end, we propose 
two algorithms for optimal DTA classification. The first algorithm runs in 0{n^) time for a general metric, 
where n is the size of the test set and the second algorithm runs in time O(n^) for special cases such as Fp 
and Jaccard. We show that our overall procedure for decision-theoretic classification is consistent. 


Related Work: A full literature survey on binary classification is beyond the scope of this manuscript. 
We focus instead on some key related results. It is well known that cl assification accuracy is optimized 
by thresholding the conditional probability of the positive class at half. iBartlett et al.l 120061] showed how 
convex surrogates could be constru cted in order to control the probability of misclassification. This work 
was extended by ISteinward 120071] to construct surrogates for asymmetric or weighted binary accuracy. 
Fr is perhaps the most studied of the non-decomposable performance metrics. For instance, I.Toachims 
]|2005|] proposed a support vector machine for directly optimizing the empirical Fp. lEewisI 119951] analyzed 
the expected Fp measure, showing that it satisfied the probability ranking principle. Based on this result, 
several auth o rs have propose d alg orithms for empiric al optimization of the expected Fp measure i ncluding 


Chai 


2005 1. Jansche 1 2005 1 and Cheng et al. I 2010|] who studied probabilistic classifier chains. Ye et al. 


120121] compared the optimal expected out-of-sample utility and the optimal training population utility for 
Fn, showing an asymp totic equivalence as the number of test samples goes to infinity. More recently. 


mp 

li 


Parambath et al.l ]|2014|] gave a theoretical analysis of the binary and multi-label Fp measure in the EUM 
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setting. IDembczvnski et al.l 11201 IH analyzed the Fj^ measure in the DTA setting including the case where 
the data is non i.i.d., and also proposed efficient algorithms for optimal classification. 


2 Preliminaries 


Let X ^ X represent instances and Y € {0,1} represent labels. We assume that the instances and labels 
are generated iid as X, y ~ P for some fixed unknown distribution P € 7^. This paper will focus on non- 
decomposable performance metrics that are general functions of the entries of the confusion matrix, namely 
true positives, true negatives, false positives and false negatives. Let bold x denote a set of n instances 
{xi, X 2 ,..., Xn) drawn from X, and y € {0, !}"■ denote the associated labels. Given a vector of predictions 

s € {0,1}"^ for instances x, the empirical confusion matrix is computed as C(s,y) = 

entries: 

^ n 1 ^ 

TP(s,y) = -^Si?/i,TN(s,y) = -^(1 - Si)(l - Vi) 

2=1 2=1 

FP(s,y) = - Vsi(l-?/0>FN(s,y) = - V(1 - 5*)^^. 

n n 

2=1 2 = 1 

To simplify notation, we will omit the arguments when they are clear from context e.g. TP instead of 
TP(s,y). 

Let 'P : [0,1]^ 1 -^ P+ denote a non-decomposable metric evaluated on the entries of the confusion 
matrix. We will sometimes use the abbreviated notation 'P(s, y) := 'P(C(s, y)) or 'P(C) := 'P(C(s,y)) 
depending on context. By non-decomposable, we mean that 'P does not decouple as a sum over individual 
instances Sj, r/j. The DTA rp-utility of s wrt. P is defined as: 

Z2'"(s;P)=Ey,.P(.|,)rP(s,y) (1) 

For the rest of this manuscript, utility will refer to the DTA utility unless otherwise noted. 

Note that the development above considered the set of classifier responses s € {0,1}"^ for a given 
sef X of n input instances. More generally, we are interested in a classifier 9 : X ^ {0, 1}, and given 
a marginal distribution on X, the expected utility of any such classifier 6{-) can be computed as 
[^'^(s;P('|x))], where Si = 6{xi). Since the optimal classifier for the expected utility must also 
optimize pointwise at each x, it is sufficient to analyze the pointwise utility directly. Consequently, 
we will focus on this quantity for the remainder of the manuscript. 

We are thus interested in obtaining the optimal classifier given by: 

s* = arg max ZF'^(s; P). (2) 

se{o,i}" 


TP FN 
FP TO 


with 


Remark 1 (BUM Utility). Fix a classifier 0 : A i-> {0,1} and a distribution P € P, and let C(0,P) = 


TP FN 
FP TN 


represent the population confusion matrix with entries: 


TP = F{e{x) = l,y = l), 
FP = F{9{x) = l,y = 0), 
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TN = F{e{x) = 0,2/ = 0), 
FN = F{e{x) = 0,2/ = 1). 









EUM utility hKoveio et al.i l20i4 \Narasimhan et alx \20l4l is computed as: 

i.e. in contrast to the DTA utility, is applied to the population confusion matrix. 


Our analysis will utilize the probability ranking principle (PRP), first formalized by iLewisI Il995h as 
a property of the metric 'b that identifies when the optimal classifier is related to the ordered conditional 
probabilities of the positive class. 

Definition 2 (Probability Ranking Principle (PRPl E^wisl il995ll l. Let denote a performance metric. We 
say that 'b satisfies PRP if for any set :x. of n input instances, and any distribution P(-|x), the optimum s* of 
the utility (HI) with respect to P(-|x) satisfies: 

min{P(y = l\xi)\s* = 1} > max{P(y = l|xi)|s- = 0}. 

Let sign : M !->■ {0,1} as sign(f) = 1 if f > 0 and sign(t) = 0 otherwise. The following corollary is 
immediate. 

Corollary 3. Let ^ be a metric for which PRP holds, and let x denote a set of n iid instances sampled 
from the marginal of a distribution P. The optimal predictions for any such x is given by the classifier 
Si = 9*{xi) = sign{¥{Y = l\xi) — 5*) where 5* G [0,1] may depend on x. 


LewisI 1 1995 1 showed that PRP holds for a specific non-decomposable measure of practical interest, the 


F/ 3 -measure; a similar result was also shown for the squared error in counting (SEC), which is designed to 
measure the squared difference between the true and the predicted number of positives. 


Theorem (ILewisI il995h l. 


1. PRP holds for Fp defined as: 

{l + f3^)fp 


^fJC) = 


{1 + P‘^)FN + P‘^FN + FP 


(3) 


2. PRP holds for SFC defined as: 

^S£c(C) = {p-vf = {FN-FPf. 
where p := i ^= TP + FN and v := ^Y^^Si = TP + FP. 

3 PRP for General Performance Metrics 

PRP is a meaningful property for any performance metric since, as a consequence of Corollary [3l any 
metric satisfying PRP admits an optimal classifier with a simple form. In this section, we identify sufficient 
conditions for a metric 'b to satisfy PRP. To begin, we consider the following equivalent representation for 
any metric 'b. 

Proposition 4. Let u = TP{s, y),v = v{s) := ^ Yi P = P(y) := k lb’ 3 ^> : [0,1]^ M+ 
such that: 

^(C(s, y)) = T>(rP(s, y), u(s),p(y)). (4) 
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Next, we consider a certain monotonicity property which we have observed is satisfied by popular binary 
classification metrics. 

Definition 5 (TP Monotonicity). A metric T' is said to be TP monotonic if when ui > U 2 and v,p fixed, it 
follows that ^{uiv,p) > <h(rt 2 ,u,p). 

In other words, 'P satisfies TP monofonicify if fhe corresponding represenfafion (Proposifion |4ll is 
monofonically increasing in ifs firsf argumenf. 

For any 'P, TP monofonicify may be verified by applying fhe represenfafion of Proposifion HI If is easy fo 
verify, for insfance fhaf ^Fp{u,v,p) = is monofonic in u. Our analysis will show fhaf fhe TP mono- 

fonicify properly is sufficienl lo guarantee fhaf 'P safisfies PRP The proof is provided in Appendix lA.ll 

Theorem 6 (Main Resull 1). The probability ranking principle holds for any 'P that satisfies TP monotonic¬ 
ity. 


While TP monofonicify of 'P is sufficienl for PRP lo hold, if is nol necessary. For insfance, consider fhe 
subclass of performance melrics where ^{■,v,p) is independenl of fhe firsl argumenf i.e. independenl of 
TP. SEC is an example of a performance mefric in Ibis family wilh <Psec(TP, v,p) = v p. The following 
proposition shows fhaf such melrics also satisfy PRP. 

Proposition 7. Let 'P = <P(TP, v,p) be a performance metric independent of TP, then 'P satisfies PRP 

Proof Suppose $(•,?;,p) is independent of its first argument. Let s* be an optimal classifier, with v* = 
v{s*). If s* does not satisfy PRP, then sort s* with respect to ¥{Y\xi) to obtain a new classifier s. It is 
clear that v{s*) = u(s), and ^{■,v{s*),p) = <P(-, u(s),p), so s is also an optimal classifier which satisfies 
PRP. □ 


3.1 Recovered and New Results 


This section outlines a few examples of known and new results recov e red vi a the application of Theorem 0 
which include a su bset of the fractional line ar family of lKoveio et al.l 1201411 and the family of performance 
metrics studied hv lNarasimhan et al.l 1201411 . 


The Fractional Linear Family: iKoveio et al.l 1201411 studied a large family of performance metrics, and 


showed that their BUM optimal classifiers are given by the thresholded sign of the marginal probability 
of the positive class. This family contains, for example the Fp and Jaccard measures. The family T'fl is 
equivalently represented by: 


T>FL(TP(s,y),u(s),p(y)) 


Co + CiTP + C2V + C3P 
do + diTP + d2V + dsp 


(5) 


for bounded constants Ci,di, i = {0,1, 2,3}. Our analysis identifies a subclass of this family that satisfies 
PRP. The following result can be proven by inspection and is stated without proof. 


Proposition 8. If ci > di, then satisfies TP monotonicity. 
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Performance Metrics fr om iNarasimhan et alj 112014] : An alternative three-parameter representation of 
metrics 'h was studied by INarasimhan et alJ 1201411 as described in the following proposition. 


Proposition 9 ( Narasimhan et al. 1 2014 1). Let p = p(y) := Vp = TPR{s,y) = and 

rn = TNR{s, y) = then 3 F : [0,1]^ —?> M+ such that: 

^(C(s,y)) =T{TPR{s,y),TNR{s,y),p{y)). (6) 


As shown in Table[T] many performance metrics used in practice are easily represented in this form. Rep¬ 
resentation for additional metrics is simplified by including the empirical precision, given by Prec(s, y) = 

where r;(s) := ^ Yli Sj = TP -|- FP can be computed from the quantities in Proposition |9l 
Consider the following monotonicity property relevant to the representation in Proposition |9l 


Definition 10 (TPR/TNR Monotonicity), A metric 'P is said to be TPR/TNR monotonic if when Cpi > rp 2 
and r^i > r „2 and p fixed, it follows that r(rpi, r„i,p) > r(rp 2 , 


In other words, 'P satisfies TPR/TNR monotonicity if the corresponding representation F (Proposition |9l) 
is monotonically increasing in its first two arguments. It can be sho wn that all the measures listed in Table 
[U satisfy TPR/TNR monotonicity. Further, INarasimhan et al.l 1201411 showed that given additional smooth¬ 
ness conditions on P, the associated metrics F admit an optimal BUM classifier with the familiar signed 
thresholded form. 

The following proposition shows that any performance metric that satisfies TPR/TNR monotonicity 
also satisfies TP monotonicity. Thus, TP monotonicity is a weaker condition. The proof is provided in 
Appendix IA.2I 


Proposition 11, If"^ satisfies TPR/TNR monotonicity, then 'P satisfies TP monotonicity. 

It follows from Corollary [3]that any metric that satisfies TPR/TNR monotonicity admits a DTA optimal 
classifier that takes the familiar signed-threshold form. We can verify from the third column of Tabled that 
each of the TPR/TNR monotonic measures ^{u,v,p) is monotonically increasing in u. 

Remark 12. TP Monotonicity is a strictly weaker condition than TPR/TNR monotonicity. Consider the 
following counterexample, where 'P(s,y) = 2TP{s,y) + FP(s,y) with equivalent representation given by 
^(Sjy) = TP{s,y) + v{s) and r{s,y) = 2p{y)TPR{s,y) - (1 - p{y))TNR{s,y) - p + 1. Clearly Tf is 
TP monotonic, but not TPR/TNR monotonic. 


4 Algorithms 

In this section, we present efficient algorithms for computing DTA optimal predictions for a given set of 
instances x and a non-decomposable performance measure 'P that satisfies PRP We also examine the con¬ 
sistency of the proposed algorithms. Apriori, solving (O is NP-hard. The key consequence of Theorem [^is 
that we do not have to search over 2"^ possible label vectors to compute the optimal predictions. In light of 
Corollary [3l it suffices to consider n -|- 1 prediction vectors that correspond to selecting top k instances as 
positive, after sorting them by P(y = l|x), for some k. Even when P(y = l|x) is known exactly, it is not 
obvious how to compute the expectation in ^ without exhaustively enumerating y vectors. We now turn to 
address these computational questions. 
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Table 1: Performance metrics for which probability ranking principle (PRP) holds. The third column ex¬ 
presses each measure 'k(s,y) as <f>(TP, r;(s),p(y)). 


Metric 

Definition 

$(n, n,p) 

AM 

Fd 

Jaccard 

G-TP/PR 

G-Mean 

H-Mean 

Q-Mean 

(fPR-f fNR)/2 

V ' r )! Vprej. xpr; 

fp/(fP-hFP-f FN) 

\/ TPR.Prec 

VfPR.TNR 

' ''TPR TNR^ 

1 - i ((1 - fPR)2 -f (1 - fNR)2) 

u-\-p{l—v—p) 

p(i-p) 

(1+/3U« 

u 

p+v—u 

u 

u(l—v—p-\-u) 

p(l-p) 

2u{l—v—p+u) 

{l—v—p)p+u 

1 - Hi"?)" + (=¥)") 


4.1 0{n^) Algorithm for PRP Measures 


Ye et al.l 1201211 suggest a simple trick to compute the expectation in O(n^) time for the F^-measure. We 
make the observation that by evaluating T' through <f>, we can essentially use the same trick to obtain a cubic¬ 
time algorithm to solve (O for general measures T' satisfying the probability ranking principle. Consider the 
vector s € {0,1}” with the top k values set to 1 and the rest to 0, and let Si-j := YlJ=i Vi- Note that any y G 
{0,1}"^ that satisfies Si^k = ki and Sk+i-.n = ^ 2 > 'I'(s,y) can simply be evaluated as -k, -{ki + 

k 2 )). Thus if'^(s,P) = X^yefo i}" ^(y|^)^(®)y) evaluated as a sum over possible values of ki 

and k 2 , where the expectation is computed wrt. P(5i:fc = ki)P{Sk+i-.n = ^ 2 ) with 0 < ki < k and 
0 < k 2 < n — k. Now, it remains to compute P{Si-k = ki) and P{Sk+i-.n = ^ 2 ) efficiently. 

Let rji = P(yj = l|xj). A consistent esti mate of this quantity may be obtained by minimizing a 
strongly proper loss function such as logistic loss Reid and Williamson I 2009ll . Using the iid assumption 
on the draw of labels, we can show that P{Si:k = ki) and P{Sk+i:n = ^ 2 ) are the coefficients of z* in 
^j=i[r]jz + (1 - Vj)] and [rjjz + (1 — rjj)], each of which can be computed in time 0{v?) for fixed 

k. Note that the metric T' can be evaluated in constant time. The resulting 0{n^) algorithm is presented in 
Algorithm [T] The overall method is as follows: 

1. First, obtain an estimate of rji = P(yi = l\xi) e.g. via logistic regression. 

2. Re-order indices in the descending order of estimated r^j’s. 

3. Then, invoke Algorithm [T] with the sorted r/j’s to compute s*. 


4.2 O(n^) Algorithm for a Subset of Fractional-Linear Metrics 

We focus our attenti on on the fractional-linear family of non-decomposable performance metrics studied by 
Koveio et al.l 1 2014 1. Recall that a fractional-linear metric can be represented by <1>fl as given in (l5]l. As 
shown in in Proposition [H 'I'fl satisfies TP monotonieity when ci > di. For certain measures in the T'fl 
family, we can get a more efficient algorithm for solving ([21). In particular, when C 3 = 0 in ([5]), we can 
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Algorithm 1 Computing s* for PRP 'P 

1: Input: 'P and estimates of rji for instances xi with indices i = 1, 2,. .., n sorted wrt. r]i 
2: Init s* = 0, Vi e [n], 

3: for A: = 1 to n do 

4: For 0 < i < fe, set Ck[i] as the coefficient of in {rjiZ + (1 — r?i)). 

5: For 0 < z < n — /c, set Dk [z] as the coefficient of z"^ in + (1 — r]i)). 

6 : ^ X] Ck[ki]Dk[k2]^{^h,^k,^{ki + k2)). 

0<ki<k 

0<k2'^n—k 

7: end for 

8: Set k* •(— argmaxfc 'P^. and s* t— 1 for z € [k*]. 

9: return s* 


give a quadratic-time procedure for computing s* that generalizes the method proposed by lYe et al.l 11201211 
when the constants {do, ^s} are rational. Formally, we consider the sub-family of TP monotonic 

fractional-linear metrics: 


(T'sfl : ^FLiu,v,p) = 


Co + CiU + C2V 


do + diu -I- d2V + dop 

Consider Step 6 of Algorithm [T] for a measure in family ([71): 


Cl > di, and do, di, d 2 , da are rational}. (7) 


'Pfc ^ ^ C[A:i](con-t-ci/ci-I-C 2 A:) ^ i2[A:2]/(don-|-(di-|-da)^!-|-d2A:-|-d3A:2). 

0<ki<k 0</c2<n—fc 

Define 6(A:, a) = X]o<A: 2 <n-fc-^fc[^ 2 ]/(a + ^ 3 ^ 2 )- Verify fhaf 6(n, a) = l/a. From fhe facf fhaf i7fc_i[z] = 
VkDkii - 1 ] + (1 - 'nk)Dk[i], it follows fhaf: 


b{k - 1, a) = rjkb{k, a + do) + {1 - Pk)b{k, a). 

Now, when dj’s are rational, i.e. di = qi/ri, the above induction can be implemented using an array to 
store the values of b, for possible values of a. The resulting O(n^) algorithm is presented in Algorithm |2] 
Algorithm |2] applies to the Fjo as well as the Jaccard measure listed in Tabled] 


Correctness of Algorithm]!] When da 7 ^ 0, at line 7 of Algorithm|2| we can verify that 5[z] = b{k, (z -|- 
hn)do/ju, 2 ), and therefore at line 9, S[{ju,i +ju, 2 )ki +jvk] = b{k, {ju,i +ju, 2 )ki +jvk+jon)do/ju, 2 ) = 
b{k, (di -|- do)ki + d 2 k -|- don) as desired. When da = 0, b{k, a) = b(k — 1, a) for all 1 < k < n. Let 
g'a = 0 and ra = 1. Then, line 5 sets 5[z] = z’orir 2 /(z +jon), line 11 maintains this invariant as ju ^2 = 0 in 
this case, and therefore at line 9, -|- ju, 2 )ki + jvk] = l/{diki -|- d 2 k -|- don) as desired. 

Consistency: Consider a procedure that maximizes the utility W^(s, P(y|x)) computed with respect to a 
consistent estimate ]P(y jx) of the probability P(y |x). Here, we show that any such procedure is consistent. 
The proof is provided in Appendix IB. II 

Theorem 13. Let p{xi) = P(y = l\xi), and assume the estimate 'r){xi) satisfies r]{xi) A r]{xi). Given 
a bounded performance metric 'P and a fixed test set of size n, let s* = arg maXgg|o,i}'* W^(s;P(y|x)) 













Algorithm 2 Computing s* for 'I'sfl in the family ([7]l 
1 : Input: Estimates rji for instances Xi, i = 1,2,... ,n sorted wrt. rji, and co,ci,C 2 ,di = qi/ri,i = 
0,1, 2, 3 corresponding to <1>sfl 
2 : Init s* = 0, Vi € [n]. 

3: Set jo ^ ?’i?’2?’39o, ju,i <- ror 2 r 3 qi, ju ,2 ^ jv ^ rorir3q2 

4: for 1 < i < + \ju, 2 \ + \jv\)n do 

5: set S[i\ ^ rorir 2 r 3 /(i + jon). 

6 : end for 

7: for fc = n to 1 do 

8 : For 0 < i < A;, set Ck[i] as the coefficient of z* in {r]iZ + (1 — r/i)). 

9: ^SFL;fc ^ (con + Cl/ci + C2A:)C'fc[A;i]S'[(j„,i+j„,2 )A:i+ j^,/c]. 

0<fci<fc 

10 : for i = 1 to (|j„,i| + |j„, 2 | + \jv\)ik - 1 ) do 

11 : S[i\ ^ (1 - %)5’[i] +??fc5’[i + jn, 2 ]- 

12 : end for 

13: end for 

14: Set k* <r- argmaxfc 'I'sFL;fc and s* ^ 1 for i € [A:*]. 

15: return s* 


be the utility optimal prediction with respect to P and s = argmaXgg|o,i}" ZV'^(s; ]P(y |x)) be the utility 
optimal prediction with respect to the consistent estimate P(y|x), then 

Z^'^(s*;P) -W'^(s;P) A 0. 


As stated in Theorem [T3l consistency of DTA utility maximization with empirical probability estimates 
does not depend on P R P. Th u s, the c onsistency resu lt s also apply to p revious algorithms proposed for 
e.g. by Lewis 1 1995 1. Chai 1 2005 ]. Jansche 1 2005 ]. Ye et al. 1 20121] that did not include an analysis of 
consistency with empirical probability estimates. In the special case of TP monotonic performance metrics, 
the following corollary, which follows directly from Theorem [131 shows that Algorithm [Hand Algorithm [2] 
are consistent. 


Corollary 14. Assume the estimate f]{x) satisfies p{x) A ri{x) and the performance metric T* that is 
TP monotonic. For a fixed test set of size n, let s denote the output of Algorithm\I}( or Algorithm^ where 
applicable) using the empirical estimate r){xi). Then 


AY'^(s*;P) -W'^(s;P) 4 0, 


where s* is the optimal prediction computed with respect to the true distribution r](xi) = P(y = l|xj). 


5 Experiments 

We present two sets of experiments. The first is an experimental validation on synthetic data with known 
ground truth probabilities. The results serve to verify the probability ranking principle (Theorem [Q for 
some of the metrics in Table [H The second set is an experimental evaluation of DTA optimal classifiers 
on benchmark dafasefs, and includes a comparison fo EUM opfimal classifiers and sfandard empirical risk 
minimizafion wifh a fixed fhreshold of 1/2 - designed fo optimize classification accuracy. 
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Figure 1: PRP of metrics from Table [T] demonstrated on synthetic data. In each case, we verify that s* is 
obtained by thresholding r]{xi) at a fixed value. Furthermore, different measures are optimized at different 
thresholds on x from the same distribution P. 



5.1 Synthetic data: PRP for general metrics 

We consider four metrics from Table [T] namely AM, Jaccard, Fi (harmonic mean of Precision and Recall) 
and G-TP/PR (geometric mean of Precision and Recall) which satisfy PRP from Theorem 0 To simulate, 
we sample a set of ten 2-dimensional vectors x = {xi,X 2 , ■ ■ ■ ,xio} from the standard Gaussian. The 
conditional probability is modeled using a sigmoid function: rji = P(y = l|xj) = , for a 

random vector w also sampled from the standard Gaussian. The optimal predictions s* that maximize the 
DTA objective (IH) are then obtained by exhaustive search over the possible label vectors. For each 
metric, we plot the conditional probabilities (in decreasing order) and s* in Figure [T] We observe that PRP 
holds in each case (Algorithms [T] and |2] produce identical results; plots not shown). 


5.2 Benchmark data: Evaluation of the proposed algorithms 


We perform DTA classification using the proposed approach (i) obtain a model for the conditional distribu¬ 
tion r]{x) = P(y = l|x) using training data and (ii) compute compute s* for the test data using estimated 
conditionals in the proposed Algorithms [T] and |2] We use logistic loss on the training samples (with L 2 
regularization) to obtain an estimate fj{x) of P(y = l|x). In our experiments, we consider the four per¬ 
formance metrics AM, Fi, Jaccard and G-TP/PR. For AM and G-TP/PR we use Algorithm [T] while for the 
fractional-linear metrics Jaccard and Fi we use the more efficient Algorithm |2] Let y* denote the true labels 
for the test data. We report the achieved held-out utility 'I'(C(s*, y*)). 

We compare DTA classification using the aforementioned metrics with that of the BUM classifiers using 
fhe corresponding m efri cs as discussed in Renwk [T] We use fhe plugin-esfimafor mefhod proposed by 
Koveio ef all 112014] and iNarasimhan et al.l 112014] . where fhe optimal classifier is given by sign(i 7 (x) — (5). 
The fraining dafa is splif info fwo sefs, one sef is used for esfimafing rj{x) and fhe ofher for selecting fhe 
opfimal 6. The predictions are fhen made by fhresholding ?)(x) of fhe fesf dafa poinfs af <5. We also compare 
fo fhe baseline mefhod of fhresholding ^(x) af 1/2. 

We reporf resulfs on seven benchmark dafasefs (used in [Koveio ef al.l 12014], lYe ef all 120121]'). fli 


Reut ers, consisting of 8293 news articles categorized info 65 fopics. Following IlYe ef al.Ll2012LlKovejo ef al.L 
2014] . we presenf resulfs for averaging over fopics fhaf had af leasf T positives in fhe fraining (5946 articles) 
as well as fhe fesf (2347 arficles) dafa; (2) Letters dafasef consisting of 20000 handwriffen characfers 
(16000 fraining and 4000 fesf insfances) categorized info 26 leffers; (3) SCENE (a UCI benchmark dafasef) 
consisting of 2230 images (1137 fraining and 1093 fesf insfances) categorized info 6 scene fypes; (4) WEB¬ 
PAGE binary dafasef, consisting of 34780 web pages (6956 frain and 27824 fesf); highly imbalanced, wifh 
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Dataset 

T 

DTA 

Baseline 

EUM 

DTA 

Baseline 

EUM 



Fi 

Fi 

Fi 

Jaccard 

Jaccard 

Jaccard 


1 

0.5875 

0.5151 

0.4980 

0.4761 

0.4308 

0.4257 

Reuters 

10 

0.8247 

0.7624 

0.7599 

0.6801 

0.6409 

0.6910 

(65) 

50 

0.8997 

0.8428 

0.8510 

0.7515 

0.7448 

0.7578 


100 

0.9856 

0.9675 

0.9669 

0.9398 

0.9375 

0.9357 

Betters (26) 

1 

0.7110 

0.4827 

0.5745 

0.4272 

0.3632 

0.4318 

Scene (6) 

1 

0.9626 

0.6891 

0.5916 

0.3540 

0.0206 

0.2080 

Web page 

1 

0.8394 

0.6269 

0.6267 

0.4637 

0.5215 

0.5194 

Spambase 

1 

0.9636 

0.8798 

0.8892 

0.7314 

0.7867 

0.8003 

Image 

1 

0.9578 

0.8571 

0.8581 

0.7455 

0.7500 

0.7623 

Breast Cancer 

1 

0.9793 

0.9589 

0.9766 

0.9342 

0.9211 

0.9481 


Table 2: Comparison of methods: Linear-fractional metrics, Fi and Jaccard. Baseline refers to thresholding 
fj{x) at 0.5; DTA refers to the proposed me t hod o f computing s* using Algorithm |2l and BUM refers to 
the plugin-estimator method in iKoveio et all i2014l] . First three are multi-class datasets (number of classes 
indicated in parenthesis): metric is computed individually for each class that has at least T positive instances 
(in both the train and the test sets) and then averaged over classes. 


only about 182 positive instances in the train; (5) IMAGE, with 1300 train and 1010 test images; (6) Breast 
Cancer, with 463 train and 220 test instances, and (7) Spambase with 3071 train and 1530 test instance^ 
The results for Fi and Jaccard metrics (using Algorithm |2] for DTA) are presented in Tabled We find that 
DTA classifier which optimizes for the threshold with respect to the test instances, often improves the utility 
compared to the baseline or the BUM style of using a threshold selected with training data. The results for 
AM and G-TP/PR metrics (using Algorithm [T] for DTA) are presented in Table [3] In this case, while choos¬ 
ing a threshold other than 1/2 helps, there is no clear winner between the DTA and the BUM approaches. 
Overall, our results are consistent with the literature which suggests that threshold optimization results in 
improved performance. DTA utility optimization outperforms the baselines using some metrics, and results 
in performance comparable to BUM for others. Additional empirical study is planned for future work. 


6 Conclusions and Future Work 


The goal of this paper is to bridge a gap in the binary classification literature, between empirical utility 
maximization (BUM) and decision theoretic analysis. In particular, our analysis shows that many popular 
metrics satisfy a probability ranking principle, so the DTA optimal classifier is given by the signed thresh¬ 
olding of the conditional probability of the positive class. This result matches a similar analysis in the BUM 
literature. 

We propose a TP monotonicity property for metrics, which if satisfied is sufficient to guarantee that the 
metric satisfies fhe probability ranking principle. We show that TP monoton icity is satisfied by large f amilies 
of binary performance metrics including the monot onic family studied by iNarasimhan et al.l 1201411 . and a 
large subset of the linear fractional family studied bv lKoveio et al.l 1201411 . We also recover known results for 
the special cases of Fp and SEC. We propose efficient and consistent estimators for optimal expected out- 
of-sample classification. In particular, we show that as a consequence of the probability ranking principle. 


‘See lKoveio et all l2014ll . lYe et all 12012ll for more details on the datasets 
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Dataset 

T 

DTA 

Baseline 

BUM 

DTA 

Baseline 

BUM 



AM 

AM 

AM 

G-TP/PR 

G-TP/PR 

G-TP/PR 


1 

0.8834 

0.7223 

0.7733 

0.7289 

0.5447 

0.5299 

Reuters 

10 

0.9520 

0.8360 

0.9111 

0.8066 

0.7800 

0.8076 

(65) 

50 

0.9659 

0.9017 

0.9582 

0.8495 

0.8441 

0.8691 


100 

0.9783 

0.9761 

0.9781 

0.9687 

0.9675 

0.9672 

Betters (26) 

1 

0.8715 

0.7020 

0.8720 

0.5787 

0.5064 

0.5902 

Scene (6) 

1 

0.5840 

0.5065 

0.5810 

0.5069 

0.0605 

0.3848 

Web page 

1 

0.8689 

0.8205 

0.8750 

0.6617 

0.6867 

0.6886 

Spambase 

1 

0.8780 

0.9010 

0.9090 

0.8494 

0.8831 

0.8913 

Image 

1 

0.8041 

0.8192 

0.8069 

0.8676 

0.8577 

0.8702 

Breast Cancer 

1 

0.9796 

0.9661 

0.9830 

0.9660 

0.9590 

0.9734 


Table 3: Comparison of methods: AM and G-TP/PR metrics. Baseline refers to thresholding fi{x) at 0.5; 
DTA refers to the p roposed method o f comp uting s* using Algorithm [H and BUM refers to the plugin- 
estimator method in INarasimhan et al.l 1201411 . First three are multi-class datasets (number of classes indi¬ 
cated in parenthesis): metric is computed individually for each class that has at least T positive instances (in 
both the train and the test sets) and then averaged over classes. 


computational requirements can be reduced from exponential to cubic complexity in the general case, and 
further reduced to quadratic complexity in special cases. 

The similarity b etween th e DTA optimal and BUM optimal classifiers suggests a more fundamental 
connection. Indeed, lYe et al.l 1201211 showed that in the special case of Fjs, the DTA and BUM optimal 
classifiers as asymptotically equivalenf as fhe number of fesf samples lends to infinily. A similar resulls 
can be shown for any classifier lhal satisfies fhe probabilily ranking principle. The delails of fhe resull will 
be included in fhe extended version of Ihis manuscripf. For fulure work, we plan fo extend our analysis lo 
mulliclass and mulfilabel classification, to explore if and when fhe opfimal classifiers lake a simple form, 
and fo design efficienl classificalion algorilhms. 
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A Appendix A 

A.l Proof of Theorem |6] 

The proof is by contradiction. Fix a distribution P E P, and let x denote a set of n iid samples from the 
marginal P;^- Denote P(y = l|xj) = rji and the optimal classifier by s* E {0,1}"^. Suppose there exist 
indices j, k such that s* = 1, = 0 and rjj < rj^. Let s' E {0,1}” be such that s' = 0 and s'^ = 1, but 

identical to s* otherwise i.e. s* = s^ Vi E [n]\{j, k}. Note that '®i- 

By optimality of s* it is clear that, 


ZY’^(s*;P)-ZY’^(s';P) > 0. (8) 

Consider the LHS, P) — {s'-, P) is equal to: 

^ P(y|x)[T'(s*,y) - T'(s',y)] 

y6{o,i}" 

= X] ^(y|x)[^'(s*,y) - ^'(s',y)] 

+ X] P(y|x) jT'(s*,y) - ^(s',y)]^ 

ye{0,i}^.yj=yk ' ^ 


Note that when yj = 7/^ = 0, ^11=1 ^IVi = Ya=i y) - y) = 0. It follows that the term 

(*) equals 0. 

Net we apply the representation of Proposition |4] with v{s) = ^ p{y) = ^ Vi- Let z E 

{0, denote the vector corresponding to re — 2 indices {yi, i E [rr] \{j, A:}}, then P) —U^{s'-, P) 

is given by: 


^ P(y|x)[T'(s*,y) - T'(s',y)] = 

^ P(z, 2 /j = 1,2/fc = 0|x)[^>(TP(s*,y),u(s*),p(y)) 

zE{0,1}^“^ 

-T>(TP(s',y),u(s'),p(y))] 
+P(z,yj =0,yk = l|x)[^>(TP(s*,y),r;(s*),p(y)) 
-T>(TP(s',y),u(s'),p(y))] 


Let s = {s* Vi E [re] \ {j,k}} and define #TP(z) := #p(z) = Zi (where the # prefix 

indicates counts rather than normalized values), and note that u(s*) = v{s'). With these substitutions. 
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(s*; P) — (s'; P) is given by: 


^ P(z,yj = l,yfc = 0|x) 
ze{o,i}"-2 

cbf-(#TP(z) + l),n(s'), -(#p(z) + 1)V 

\n n J 

^>f-#rP(z),n(s'), -(#p(z) + 1)^ 
\n n J 

+P(z,yj = 0,yfc = l|x) 

$r-#rP(z),n(s'), -(#p(z) + 1)^- 
\n n J 

<b(-(#rP(z) + l),n(s'), -(#p(z) + 1)) 
n n J 


Next applying the iid assumption on the labels, we have that P(z,yj,yfc|x) = P(z|x)P(yj|x)P(yfc|x), so 
that the equation further simplifies to: 

P(z|x) 

ze{o,i}"-2 

-(#p(z) + 1)^ - $r-#TP(z),n(s'), -(#p(z) + 1)^ 
n J \n n J 

Vj{^ - Vk) - VkCi-- Vj) - 

ivj - m) Y 

ze{o,i}"-2 

-(#p(z) + 1)) -^(-#TP{z),v{s'), -(#p(z) + 1)^ 
n J \n n J 


-(#TP(z) + l),r;(s'), 


n 


$ -(#TP(z) + l),r;(s'), 


n 


Note that for each z G {0,1}" 

• <1> ^^(^TP{z) + 1), n(s'), i(#p(z) + 1)^ can be interpreted as 'I' computed on the vectors y € M” 

defined as {vi = Zi i ^ [n] \ {j, A:}} U {yj = 1} U {yk = 0}, and s* G M”' (which is fhe assumed 
opfimal). 

• <f> ^^^rP(z), n(s'), i(^p(z) + l)^ can be inferprefed as compufed on fhe vectors y G M" defined 
as above and s' G M”. 


By TP monofonicify of for each z, fhe difference term ^^^{^TP{z) + l),n(s'), ^{#p{z) + 1)J — 

<h^i^TP(z), r;(s'), ^(^p(z) + 1)^ > 0. This combined wifh ([8]l implies fhaf yj — ?7fc > 0 which is a 
confradicfion. 
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A.2 Proof of Proposition ITT] 


Suppose'I'satisfies TPR/TNRmonotonicity. Lettti = TP(si, yi) and U2 = TP(s2, y2), v = t'(si) = ?;(s2) 
and p = p(yi) = p(y2)- Note that ^{ui,v,p) = r(^, (and similarly equality holds for 

^{U2,v,p)). Now, w^never tti =JIP(si,yi) >_TP(s2,y2) =JH2, '^(si) = v{s2) = v, andp(yi) = 
P{y2) =P, wehaveTPR(si,yi) > TPR(s2, y2),TNR(si,yi) > TNR(s2, y2), and 


^{ui,v,p) = 


(*) 

> 


Ui 1-V-p + Ul 

r(—, -^- ,p) 

p 1 — p 

r(fPR(si,yi),fNR(si,yi),p) 


r(TPR(s2,y2),TNR(s2,y2),p) 

. 1-V-P + U2 , 

^[U2-P, -^- ,P) 

1 — p 

^{U2,v,p) 


where (*) follows from TPR/TNR monotonicity of 'P. Thus 'P satisfies TP monotonicify. 


B Appendix B 

B. 1 Proof of Theorem [13] 

Let := ZY'^(s*; P) and let P). Also define the empirical distribution: 

p(y|x) = - fj{xi))^~y\ 

Now consider: 

< 2max|W'^(s;P)-W'^(s;P)| (9) 

For any fixed s G {0,1}”, we have: 

|^f'^(s;P) -W'^(s;P)| 

= 1 P(y|x)^'(s,y) - P(y|x)T'(s,y)| 

3/6{0,l}" vefo.i}"^ 

< |P(y|x)-P(y|x)|T'(s,y) (10) 

V6{0,1}- 

Let p(x) denote the empirical estimate obtained using m training samples. Now because p(x) A p(x), 
we have that for sufficiently large set of training examples, P(y|x) A P(y|x); i.e. for any given e > 0, 
there exists such that for all m > rrie, |P(y|x) — P(y|x)| < e, with high probability. It follows that, 
with high probability, (fTOl) < e i}^ y)- Assuming T' is bounded, we have that for any fixed s, 

|(Y'^(s;P)-W'^(s;P)| < Ce, for some consfant C fhaf depends only on the metric 'P and (fixed) fest set size 
n. The uniform convergence also follows because the max in (|9|l is over finitely many vectors s. Putting 
together, we have that for any given S, e' > 0, there exists training sample size m^/^s such that the output s 
of our procedure satisfies, with probability at least 1 — <5, Uf — lA^ < e'. The proof is complete. 
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